# Closing price of Google Stock Prediction

# 1. Problem

Predicting the closing price of a stock is a complex problem because of several challenges. Stock prices are influenced by a multitude of factors such as market trends, Analyzing and incorporating all these factors accurately into a predictive model is a complex task. Market volatility makes predicting stock prices accurately challenging. Data Quality and Quantity, the pursuit of solving this problem is crucial because accurate stock price predictions have significant implications for investors, financial institutions, and businesses. Accurate predictions can aid investors in making informed decisions. The importance of predicting stock prices lies in its implications for investors, financial institutions, and businesses, it can potentially help investors make more informed decisions about buying, selling, or holding stocks, aiding in risk.

2. Data mining Task

In our project, we will use two data mining tasks to help us predict the closing price of a stock. two of the methods you can consider are classification and clustering. For classification, we will train our model to be able to classify the close price based on a set of attributes such as volume, open, high, low, length etc. For clustering, we will partition closing prices into subnets or clusters, where they are similar to prices in cluster but dissimilar to prices in other clusters based on the attributes Low, Heigh, Open, volume, adjClose, adjHigh.

3. Data

Our dataset is from the source: https://www.kaggle.com/datasets/shreenidhihipparagi/google-stock-prediction

Number of Attributes: 14

Number of objects: 1258

Attribute characteristics:

+————+———+———————————————–+ | Attribute | Data | Description | | Name | Type | | +————+———+———————————————–+ | symbol | unique | Name of company | | | value | | +————+———+———————————————–+ | date | numeric | date: day, month, and year. | +————+———+———————————————–+ | close | numeric | closing price of a stock is the final price | | | | at which a stock is traded on a given trading | | | | day. | +————+———+———————————————–+ | high | numeric | The highest price at which a stock traded | | | | during a specific trading day. | +————+———+———————————————–+ | low | numeric | The lowest price at which a stock traded | | | | during a specific trading day. | +————+———+———————————————–+ | open | numeric | The price of a stock at the beginning of a | | | | trading day. It’s the price at which the | | | | first trade occurred on that day. | +————+———+———————————————–+ | Volume | numeric | The total number of shares traded during a | | | | trading day. Volume is a measure of market | | | | activity and liquidity for a stock | +————+———+———————————————–+ | adjClose | | The closing price of a stock adjusted for any | | | numeric | corporate actions like dividends, stock | | | | splits, or other events that could affect the | | | | stock price. | +————+———+———————————————–+ | adjHigh | numeric | The highest price of a stock during a trading | | | | day, adjusted for any corporate actions | +————+———+———————————————–+ | adjLow | numeric | The lowest price of a stock during a trading | | | | day, adjusted for any corporate actions. | +————+———+———————————————–+ | adjOpen | numeric | The opening price of a stock at the beginning | | | | of a trading day, adjusted for any corporate | | | | actions. | +————+———+———————————————–+ | adjVolume | numeric | The trading volume of a stock adjusted for | | | | any corporate actions. This can provide a | | | | clearer picture of tranding activity. | +————+———+———————————————–+ | divCash | Binary | The amount of money paid by a company to its | | | | shareholders as a portion of its profits. | | | | Dividends are typically paid on a per-share | | | | basis | +————+———+———————————————–+ | s | Binary | If a stock undergoes a stock split, the split | | plitFactor | | factor indicates the ratio by which the | | | | shares were split. For instance, a 2-for-1 | | | | split means that for every old share, you now | | | | have 2 new shares. | +————+———+———————————————–+

# Load necessary packages
if (!require(caret)) {
  install.packages("caret")
}
if (!require(cluster)) {
  install.packages("cluster")
}
if (!require(fpc)) {
  install.packages("fpc")
}
if (!require(ggplot2)) {
  install.packages("ggplot2")
}
library(caret)
library(cluster)
library(fpc)
library(ggplot2)
dataset = read.csv('GOOG.csv') 
View(dataset)
print(dataset)

we removed the attributes (symbol, divCash, splitFactor) as they have one value only so we do not need them

Convert the date column to a date format

dataset=dataset[,2:12]
View(dataset)
dataset$date <- as.Date(dataset$date, format = "%Y-%m-%d %H:%M:%S")
print(dataset)
str(dataset)
'data.frame':   1258 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" "2016-06-15" "2016-06-16" ...
 $ close    : num  718 719 710 692 694 ...
 $ high     : num  722 723 717 709 702 ...
 $ low      : num  713 717 703 688 693 ...
 $ open     : num  716 719 715 709 699 ...
 $ volume   : int  1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
 $ adjClose : num  718 719 710 692 694 ...
 $ adjHigh  : num  722 723 717 709 702 ...
 $ adjLow   : num  713 717 703 688 693 ...
 $ adjOpen  : num  716 719 715 709 699 ...
 $ adjVolume: int  1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
summary(dataset)
      date                close             high             low              open     
 Min.   :2016-06-14   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671  
 1st Qu.:2017-09-12   1st Qu.: 960.8   1st Qu.: 968.8   1st Qu.: 952.2   1st Qu.: 959  
 Median :2018-12-11   Median :1132.5   Median :1143.9   Median :1117.9   Median :1131  
 Mean   :2018-12-12   Mean   :1216.3   Mean   :1227.4   Mean   :1204.2   Mean   :1215  
 3rd Qu.:2020-03-12   3rd Qu.:1360.6   3rd Qu.:1374.3   3rd Qu.:1348.6   3rd Qu.:1361  
 Max.   :2021-06-11   Max.   :2521.6   Max.   :2527.0   Max.   :2498.3   Max.   :2525  
     volume           adjClose         adjHigh           adjLow          adjOpen    
 Min.   : 346753   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671  
 1st Qu.:1173522   1st Qu.: 960.8   1st Qu.: 968.8   1st Qu.: 952.2   1st Qu.: 959  
 Median :1412588   Median :1132.5   Median :1143.9   Median :1117.9   Median :1131  
 Mean   :1601590   Mean   :1216.3   Mean   :1227.4   Mean   :1204.2   Mean   :1215  
 3rd Qu.:1812156   3rd Qu.:1360.6   3rd Qu.:1374.3   3rd Qu.:1348.6   3rd Qu.:1361  
 Max.   :6207027   Max.   :2521.6   Max.   :2527.0   Max.   :2498.3   Max.   :2525  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1173522  
 Median :1412588  
 Mean   :1601590  
 3rd Qu.:1812156  
 Max.   :6207027  

mean of closing price Using the mean closing price can serve as a basic reference point or a simple benchmark for forecasting future stock prices. The mean closing price is the average price at which a stock has closed over a specific period.

mean(dataset$close)
[1] 1216.317

variance Code

The concept of variance in the context of closing prices for stock prediction serves to quantify the spread or dispersion of the closing prices around their mean or average value. It provides a measure of how much the actual closing prices deviate from the average closing price over a specific period.

var(dataset$close)
[1] 146944.5

Summaries for all numeric attributes and their outliers and boxplots.

#stastistical measures
#summaries
summary(dataset$close)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  668.3   960.8  1132.5  1216.3  1360.6  2521.6 
summary(dataset$high)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  672.3   968.8  1143.9  1227.4  1374.3  2527.0 
summary(dataset$low)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  663.3   952.2  1117.9  1204.2  1348.6  2498.3 
summary(dataset$open)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    671     959    1131    1215    1361    2525 
summary(dataset$volume)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 346753 1173522 1412588 1601590 1812156 6207027 
summary(dataset$adjClose)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  668.3   960.8  1132.5  1216.3  1360.6  2521.6 
summary(dataset$adjHigh)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  672.3   968.8  1143.9  1227.4  1374.3  2527.0 
summary(dataset$adjLow)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  663.3   952.2  1117.9  1204.2  1348.6  2498.3 
summary(dataset$adjOpen)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    671     959    1131    1215    1361    2525 
summary(dataset$adjVolume)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 346753 1173522 1412588 1601590 1812156 6207027 
#outliers
boxplot.stats(dataset$close)$out
 [1] 2070.07 2062.37 2098.00 2092.91 2083.51 2095.38 2095.89 2104.11 2121.90 2128.31 2117.20
[12] 2101.14 2064.88 2070.86 2095.17 2031.36 2036.86 2081.51 2075.84 2026.71 2049.09 2108.54
[23] 2024.17 2052.70 2055.03 2114.77 2061.92 2066.49 2092.52 2091.08 2036.22 2043.20 2038.59
[34] 2052.96 2045.06 2044.36 2035.55 2055.95 2055.54 2068.63 2137.75 2225.55 2224.75 2249.68
[45] 2265.44 2285.88 2254.79 2267.27 2254.84 2296.66 2297.76 2302.40 2293.63 2293.29 2267.92
[56] 2315.30 2326.74 2307.12 2379.91 2429.89 2410.12 2395.17 2354.25 2356.74 2381.35 2398.69
[67] 2341.66 2308.76 2239.08 2261.97 2316.16 2321.41 2303.43 2308.71 2356.09 2345.10 2406.67
[78] 2409.07 2433.53 2402.51 2411.56 2429.81 2421.28 2404.61 2451.76 2466.09 2482.85 2491.40
[89] 2521.60 2513.93
boxplot.stats(dataset$high)$out
 [1] 2116.500 2078.550 2102.510 2123.547 2105.130 2108.370 2102.030 2108.820 2152.680 2133.660
[11] 2132.735 2130.530 2091.420 2082.010 2100.780 2094.880 2071.010 2086.520 2104.370 2088.518
[21] 2089.240 2118.110 2128.810 2078.040 2075.000 2125.700 2090.260 2067.060 2123.560 2109.780
[31] 2075.500 2053.100 2057.990 2072.302 2078.210 2058.870 2050.990 2058.430 2070.780 2093.327
[41] 2142.940 2237.310 2237.660 2255.000 2284.005 2289.040 2275.320 2277.210 2277.990 2306.597
[51] 2306.440 2318.450 2309.600 2295.320 2303.762 2325.820 2341.260 2337.450 2452.378 2436.520
[61] 2427.140 2419.700 2379.260 2382.200 2382.710 2416.410 2378.000 2322.000 2285.370 2276.601
[71] 2321.140 2323.340 2343.150 2316.760 2360.340 2369.000 2418.480 2432.890 2442.944 2440.000
[81] 2428.140 2437.971 2442.000 2409.745 2453.859 2468.000 2494.495 2505.000 2523.260 2526.990
boxplot.stats(dataset$low)$out
 [1] 2018.380 2042.590 2059.330 2072.000 2078.540 2063.090 2077.320 2083.130 2104.360 2098.920
[11] 2103.710 2097.410 2062.140 2002.020 2038.130 2021.290 2016.060 2046.100 2071.260 2010.000
[21] 2020.270 2046.415 2021.610 2047.830 2033.370 2072.380 2047.550 2043.510 2070.000 2054.000
[31] 2033.550 2017.680 2026.070 2039.220 2041.555 2010.730 2014.020 2015.620 2044.030 2056.745
[41] 2096.890 2151.620 2214.800 2225.330 2257.680 2253.714 2238.465 2256.090 2249.190 2266.000
[51] 2284.450 2287.845 2271.710 2258.570 2256.450 2278.210 2313.840 2304.270 2374.850 2402.280
[61] 2402.160 2384.500 2311.700 2351.410 2342.338 2390.000 2334.730 2283.000 2230.050 2242.720
[71] 2283.320 2295.000 2303.160 2263.520 2321.090 2342.370 2360.110 2402.990 2412.515 2402.000
[81] 2407.690 2404.880 2404.200 2382.830 2417.770 2441.073 2468.240 2487.330 2494.000 2498.290
boxplot.stats(dataset$open)$out
 [1] 2073.000 2068.890 2070.000 2105.910 2078.540 2094.210 2099.510 2090.250 2104.360 2100.000
[11] 2110.390 2119.270 2067.000 2025.010 2041.830 2067.450 2050.520 2056.520 2076.190 2067.210
[21] 2023.370 2073.120 2101.130 2070.000 2071.760 2074.060 2085.000 2062.300 2078.990 2076.030
[31] 2061.000 2042.050 2041.840 2051.700 2065.370 2044.810 2038.860 2027.880 2057.630 2059.120
[41] 2097.950 2152.940 2222.500 2226.130 2277.960 2256.700 2266.250 2261.470 2275.160 2276.980
[51] 2303.000 2291.980 2307.890 2285.250 2293.230 2283.470 2319.930 2336.000 2407.145 2410.330
[61] 2404.490 2402.720 2369.740 2368.420 2350.640 2400.000 2374.890 2291.860 2261.710 2261.090
[71] 2291.830 2309.320 2336.906 2264.400 2328.040 2365.990 2367.000 2420.000 2412.835 2436.940
[81] 2421.960 2422.000 2435.310 2395.020 2422.520 2451.320 2479.900 2499.500 2494.010 2524.920
boxplot.stats(dataset$volume)$out
 [1] 3402357 4449022 3530169 3841482 4269902 4745183 3654385 3017947 2973891 2965771 3246573
[12] 3487056 3160585 3270248 3731589 2921393 3248393 4626086 3095263 5125791 3142760 4758496
[23] 3336352 3360727 3267883 3029471 3369275 4760260 3088305 3318204 4405584 2950120 4187586
[34] 3880723 3212657 4595891 3552194 6207027 5130576 2833483 4805752 3316905 3055216 3932954
[45] 2867053 2978300 3790618 3365365 4226748 3700125 4252365 3861489 4233435 3651106 3601750
[56] 4044137 3344450 4081528 3573755 3208495 2951309 3793630 3157875 4267698 3429036 3581072
[67] 3107763 3103882 2888827 4330862 3570927 4016353 4118170 2986439
boxplot.stats(dataset$adjClose)$out
 [1] 2070.07 2062.37 2098.00 2092.91 2083.51 2095.38 2095.89 2104.11 2121.90 2128.31 2117.20
[12] 2101.14 2064.88 2070.86 2095.17 2031.36 2036.86 2081.51 2075.84 2026.71 2049.09 2108.54
[23] 2024.17 2052.70 2055.03 2114.77 2061.92 2066.49 2092.52 2091.08 2036.22 2043.20 2038.59
[34] 2052.96 2045.06 2044.36 2035.55 2055.95 2055.54 2068.63 2137.75 2225.55 2224.75 2249.68
[45] 2265.44 2285.88 2254.79 2267.27 2254.84 2296.66 2297.76 2302.40 2293.63 2293.29 2267.92
[56] 2315.30 2326.74 2307.12 2379.91 2429.89 2410.12 2395.17 2354.25 2356.74 2381.35 2398.69
[67] 2341.66 2308.76 2239.08 2261.97 2316.16 2321.41 2303.43 2308.71 2356.09 2345.10 2406.67
[78] 2409.07 2433.53 2402.51 2411.56 2429.81 2421.28 2404.61 2451.76 2466.09 2482.85 2491.40
[89] 2521.60 2513.93
boxplot.stats(dataset$adjHigh)$out
 [1] 2116.500 2078.550 2102.510 2123.547 2105.130 2108.370 2102.030 2108.820 2152.680 2133.660
[11] 2132.735 2130.530 2091.420 2082.010 2100.780 2094.880 2071.010 2086.520 2104.370 2088.518
[21] 2089.240 2118.110 2128.810 2078.040 2075.000 2125.700 2090.260 2067.060 2123.560 2109.780
[31] 2075.500 2053.100 2057.990 2072.302 2078.210 2058.870 2050.990 2058.430 2070.780 2093.327
[41] 2142.940 2237.310 2237.660 2255.000 2284.005 2289.040 2275.320 2277.210 2277.990 2306.597
[51] 2306.440 2318.450 2309.600 2295.320 2303.762 2325.820 2341.260 2337.450 2452.378 2436.520
[61] 2427.140 2419.700 2379.260 2382.200 2382.710 2416.410 2378.000 2322.000 2285.370 2276.601
[71] 2321.140 2323.340 2343.150 2316.760 2360.340 2369.000 2418.480 2432.890 2442.944 2440.000
[81] 2428.140 2437.971 2442.000 2409.745 2453.859 2468.000 2494.495 2505.000 2523.260 2526.990
boxplot.stats(dataset$adjLow)$out
 [1] 2018.380 2042.590 2059.330 2072.000 2078.540 2063.090 2077.320 2083.130 2104.360 2098.920
[11] 2103.710 2097.410 2062.140 2002.020 2038.130 2021.290 2016.060 2046.100 2071.260 2010.000
[21] 2020.270 2046.415 2021.610 2047.830 2033.370 2072.380 2047.550 2043.510 2070.000 2054.000
[31] 2033.550 2017.680 2026.070 2039.220 2041.555 2010.730 2014.020 2015.620 2044.030 2056.745
[41] 2096.890 2151.620 2214.800 2225.330 2257.680 2253.714 2238.465 2256.090 2249.190 2266.000
[51] 2284.450 2287.845 2271.710 2258.570 2256.450 2278.210 2313.840 2304.270 2374.850 2402.280
[61] 2402.160 2384.500 2311.700 2351.410 2342.338 2390.000 2334.730 2283.000 2230.050 2242.720
[71] 2283.320 2295.000 2303.160 2263.520 2321.090 2342.370 2360.110 2402.990 2412.515 2402.000
[81] 2407.690 2404.880 2404.200 2382.830 2417.770 2441.073 2468.240 2487.330 2494.000 2498.290
boxplot.stats(dataset$adjOpen)$out
 [1] 2073.000 2068.890 2070.000 2105.910 2078.540 2094.210 2099.510 2090.250 2104.360 2100.000
[11] 2110.390 2119.270 2067.000 2025.010 2041.830 2067.450 2050.520 2056.520 2076.190 2067.210
[21] 2023.370 2073.120 2101.130 2070.000 2071.760 2074.060 2085.000 2062.300 2078.990 2076.030
[31] 2061.000 2042.050 2041.840 2051.700 2065.370 2044.810 2038.860 2027.880 2057.630 2059.120
[41] 2097.950 2152.940 2222.500 2226.130 2277.960 2256.700 2266.250 2261.470 2275.160 2276.980
[51] 2303.000 2291.980 2307.890 2285.250 2293.230 2283.470 2319.930 2336.000 2407.145 2410.330
[61] 2404.490 2402.720 2369.740 2368.420 2350.640 2400.000 2374.890 2291.860 2261.710 2261.090
[71] 2291.830 2309.320 2336.906 2264.400 2328.040 2365.990 2367.000 2420.000 2412.835 2436.940
[81] 2421.960 2422.000 2435.310 2395.020 2422.520 2451.320 2479.900 2499.500 2494.010 2524.920
boxplot.stats(dataset$adjVolume)$out
 [1] 3402357 4449022 3530169 3841482 4269902 4745183 3654385 3017947 2973891 2965771 3246573
[12] 3487056 3160585 3270248 3731589 2921393 3248393 4626086 3095263 5125791 3142760 4758496
[23] 3336352 3360727 3267883 3029471 3369275 4760260 3088305 3318204 4405584 2950120 4187586
[34] 3880723 3212657 4595891 3552194 6207027 5130576 2833483 4805752 3316905 3055216 3932954
[45] 2867053 2978300 3790618 3365365 4226748 3700125 4252365 3861489 4233435 3651106 3601750
[56] 4044137 3344450 4081528 3573755 3208495 2951309 3793630 3157875 4267698 3429036 3581072
[67] 3107763 3103882 2888827 4330862 3570927 4016353 4118170 2986439
#boxplots
boxplot(dataset$close)

boxplot(dataset$high)

boxplot(dataset$low)

boxplot(dataset$open)

boxplot(dataset$volume)

boxplot(dataset$adjClose)

boxplot(dataset$adjHigh)

boxplot(dataset$adjLow)

boxplot(dataset$adjOpen)

boxplot(dataset$adjVolume)

This scatter plot helps us to determine whether the closing price and volume are correlated to each other or not, it shows that the two attributes are corelated and have proportional relationship.

with(dataset, plot(volume, close))

The Bar plot represents the closing price and date in dataset. It indicates that closing prices at the end of a traded day are increasing or decreasing depending on the date.

barplot(height = dataset$close, names.arg = dataset$date, xlab = "Date", ylab = "Closing price", main = "date vs Close")

This Histogram represents the frequency of a stock closing price in the dataset. After observation, we noticed that the most values lie in between 1000 to 1200.

hist(dataset$close)

4. Data preprocessing

Here is our data set before preprocessing

#dataset before preprocessing
print(dataset)

Data cleaning, including handling missing values like NULLs, is crucial before utilizing data for analysis or modeling. It’s important to get the best quality of analysis. Such as accuracy where missing or incorrect data can skew analysis, leading to inaccurate insights or predictions. And clean data ensures the reliability of your findings, reducing the risk of making decisions based on flawed information.

to find the total null values in the dataset #Checking NULL, FALSE means no null, TRUE cells means the value of the cell is null

is.na(dataset)
         date close  high   low  open volume adjClose adjHigh adjLow adjOpen adjVolume
   [1,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
   [2,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
   [3,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
   [4,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
   [5,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
   [6,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
   [7,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
   [8,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
   [9,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [10,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [11,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [12,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [13,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [14,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [15,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [16,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [17,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [18,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [19,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [20,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [21,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [22,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [23,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [24,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [25,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [26,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [27,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [28,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [29,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [30,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [31,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [32,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [33,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [34,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [35,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [36,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [37,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [38,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [39,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [40,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [41,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [42,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [43,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [44,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [45,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [46,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [47,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [48,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [49,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [50,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [51,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [52,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [53,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [54,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [55,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [56,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [57,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [58,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [59,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [60,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [61,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [62,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [63,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [64,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [65,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [66,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [67,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [68,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [69,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [70,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [71,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [72,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [73,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [74,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [75,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [76,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [77,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [78,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [79,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [80,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [81,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [82,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [83,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [84,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [85,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [86,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [87,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [88,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [89,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
  [90,] FALSE FALSE FALSE FALSE FALSE  FALSE    FALSE   FALSE  FALSE   FALSE     FALSE
 [ reached getOption("max.print") -- omitted 1168 rows ]
sum(is.na(dataset))
[1] 0
print("Since there is no NULL values we don't need to remove any rows")
[1] "Since there is no NULL values we don't need to remove any rows"

In our data since there are no Null values, we don’t need to remove any rows.

Since most attributes in our dataset are numeric and removing outliers will affect our calculations and prediction, we will remove closing price and volumes outliers only.

#dataset before removing outliers
print(dataset)
summary(dataset)
      date                close             high             low              open     
 Min.   :2016-06-14   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671  
 1st Qu.:2017-09-12   1st Qu.: 960.8   1st Qu.: 968.8   1st Qu.: 952.2   1st Qu.: 959  
 Median :2018-12-11   Median :1132.5   Median :1143.9   Median :1117.9   Median :1131  
 Mean   :2018-12-12   Mean   :1216.3   Mean   :1227.4   Mean   :1204.2   Mean   :1215  
 3rd Qu.:2020-03-12   3rd Qu.:1360.6   3rd Qu.:1374.3   3rd Qu.:1348.6   3rd Qu.:1361  
 Max.   :2021-06-11   Max.   :2521.6   Max.   :2527.0   Max.   :2498.3   Max.   :2525  
     volume           adjClose         adjHigh           adjLow          adjOpen    
 Min.   : 346753   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671  
 1st Qu.:1173522   1st Qu.: 960.8   1st Qu.: 968.8   1st Qu.: 952.2   1st Qu.: 959  
 Median :1412588   Median :1132.5   Median :1143.9   Median :1117.9   Median :1131  
 Mean   :1601590   Mean   :1216.3   Mean   :1227.4   Mean   :1204.2   Mean   :1215  
 3rd Qu.:1812156   3rd Qu.:1360.6   3rd Qu.:1374.3   3rd Qu.:1348.6   3rd Qu.:1361  
 Max.   :6207027   Max.   :2521.6   Max.   :2527.0   Max.   :2498.3   Max.   :2525  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1173522  
 Median :1412588  
 Mean   :1601590  
 3rd Qu.:1812156  
 Max.   :6207027  
str(dataset)
'data.frame':   1258 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" "2016-06-15" "2016-06-16" ...
 $ close    : num  718 719 710 692 694 ...
 $ high     : num  722 723 717 709 702 ...
 $ low      : num  713 717 703 688 693 ...
 $ open     : num  716 719 715 709 699 ...
 $ volume   : int  1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
 $ adjClose : num  718 719 710 692 694 ...
 $ adjHigh  : num  722 723 717 709 702 ...
 $ adjLow   : num  713 717 703 688 693 ...
 $ adjOpen  : num  716 719 715 709 699 ...
 $ adjVolume: int  1306065 1214517 1982471 3402357 2082538 1465634 1184318 2171415 4449022 2641085 ...
#removing close outlier
outliers <- boxplot(dataset$close, plot=FALSE)$out
dataset <- dataset[-which(dataset$close %in% outliers),]
boxplot.stats(dataset$close)$out
 [1] 1749.13 1763.37 1761.75 1763.00 1752.71 1749.84 1777.02 1781.38 1770.15 1746.78 1763.92
[12] 1768.88 1771.43 1793.19 1760.74 1798.10 1827.95 1826.77 1827.99 1819.48 1818.55 1784.13
[23] 1775.33 1781.77 1760.06 1767.77 1763.00 1747.90 1776.09 1758.72 1751.88 1787.25 1807.21
[34] 1766.72 1746.55 1754.40 1790.86 1886.90 1891.25 1901.05 1899.40 1917.24 1830.79 1863.11
[45] 1835.74 1901.35 1927.51
#removing volume's outlier
outliers <- boxplot(dataset$volume, plot=FALSE)$out
dataset <- dataset[-which(dataset$volume %in% outliers),]
boxplot.stats(dataset$volume)$out
 [1] 2641085 2700470 2749221 2607121 2553771 2712222 2634669 2720942 2560277 2580374 2558385
[12] 2726830 2680400 2619234 2675742 2580612 2769225 2673464 2576470 2642983 2597455 2561288
[23] 2660628 2611373 2611229 2574061 2664723 2668906 2608568 2610884 2568345 2636142 2602114
[34] 2748292
#data set after removing outliers
print(dataset)
summary(dataset)
      date                close             high             low              open       
 Min.   :2016-06-14   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.:2017-08-10   1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :2018-09-29   Median :1115.7   Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :2018-10-03   Mean   :1139.9   Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:2019-11-14   3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :2021-02-02   Max.   :1927.5   Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
     volume           adjClose         adjHigh           adjLow          adjOpen      
 Min.   : 346753   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.:1167344   1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :1394116   Median :1115.7   Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :1480717   Mean   :1139.9   Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:1719968   3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :2769225   Max.   :1927.5   Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1167344  
 Median :1394116  
 Mean   :1480717  
 3rd Qu.:1719968  
 Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" "2016-06-15" "2016-06-16" ...
 $ close    : num  718 719 710 694 696 ...
 $ high     : num  722 723 717 702 703 ...
 $ low      : num  713 717 703 693 692 ...
 $ open     : num  716 719 715 699 698 ...
 $ volume   : int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

Feature selection

Remove Redundant Features

# load the library        
library(mlbench)
library(caret)
library(ggplot2)
library(lattice)

# calculate correlation matrix
correlationMatrix <- cor(dataset[,3:10])

# summarize the correlation matrix
print(correlationMatrix)
             close      high       low      open    volume  adjClose   adjHigh    adjLow
close    1.0000000 0.9993759 0.9994124 0.9986066 0.1155092 1.0000000 0.9993759 0.9994124
high     0.9993759 1.0000000 0.9992333 0.9993994 0.1278230 0.9993759 1.0000000 0.9992333
low      0.9994124 0.9992333 1.0000000 0.9993082 0.1038372 0.9994124 0.9992333 1.0000000
open     0.9986066 0.9993994 0.9993082 1.0000000 0.1177215 0.9986066 0.9993994 0.9993082
volume   0.1155092 0.1278230 0.1038372 0.1177215 1.0000000 0.1155092 0.1278230 0.1038372
adjClose 1.0000000 0.9993759 0.9994124 0.9986066 0.1155092 1.0000000 0.9993759 0.9994124
adjHigh  0.9993759 1.0000000 0.9992333 0.9993994 0.1278230 0.9993759 1.0000000 0.9992333
adjLow   0.9994124 0.9992333 1.0000000 0.9993082 0.1038372 0.9994124 0.9992333 1.0000000
# find attributes that are highly corrected (ideally >0.75)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5 )

# print indexes of highly correlated attributes
print(highlyCorrelated)
[1] 2 7 4 1 6 8

dataset before normalization

#dataset before normalization 
print(dataset)
summary(dataset)
      date                close             high             low              open       
 Min.   :2016-06-14   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.:2017-08-10   1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :2018-09-29   Median :1115.7   Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :2018-10-03   Mean   :1139.9   Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:2019-11-14   3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :2021-02-02   Max.   :1927.5   Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
     volume           adjClose         adjHigh           adjLow          adjOpen      
 Min.   : 346753   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.:1167344   1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :1394116   Median :1115.7   Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :1480717   Mean   :1139.9   Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:1719968   3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :2769225   Max.   :1927.5   Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1167344  
 Median :1394116  
 Mean   :1480717  
 3rd Qu.:1719968  
 Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" "2016-06-15" "2016-06-16" ...
 $ close    : num  718 719 710 694 696 ...
 $ high     : num  722 723 717 702 703 ...
 $ low      : num  713 717 703 693 692 ...
 $ open     : num  716 719 715 699 698 ...
 $ volume   : int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

normalization was performed to ensure consistent scaling of the data. The normalization technique applied was the max-min normalization. This technique rescales the values of specific attributes within a defined range between 0 and 1.

We can use the normalized dataset provides a more uniform and comparable representation of the attributes, enabling accurate analysis and modeling for stock predaction with result as shown.

normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataWithoutNormalization <- dataset
dataset$close<-normalize(dataWithoutNormalization$close)
dataset$volume<-normalize(dataWithoutNormalization$volume)
dataset$open<-normalize(dataWithoutNormalization$open)
dataset$low <-normalize(dataWithoutNormalization$low)
dataset$high <-normalize(dataWithoutNormalization$high)

dataset after normalization

#dataset after normalization 
print(dataset)
summary(dataset)
      date                close             high             low              open       
 Min.   :2016-06-14   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2017-08-10   1st Qu.:0.2175   1st Qu.:0.2115   1st Qu.:0.2162   1st Qu.:0.2147  
 Median :2018-09-29   Median :0.3553   Median :0.3532   Median :0.3524   Median :0.3554  
 Mean   :2018-10-03   Mean   :0.3746   Mean   :0.3718   Mean   :0.3723   Mean   :0.3736  
 3rd Qu.:2019-11-14   3rd Qu.:0.4736   3rd Qu.:0.4701   3rd Qu.:0.4698   3rd Qu.:0.4722  
 Max.   :2021-02-02   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     volume          adjClose         adjHigh           adjLow          adjOpen      
 Min.   :0.0000   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.:0.3387   1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :0.4324   Median :1115.7   Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :0.4681   Mean   :1139.9   Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:0.5669   3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :1.0000   Max.   :1927.5   Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1167344  
 Median :1394116  
 Mean   :1480717  
 3rd Qu.:1719968  
 Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" "2016-06-15" "2016-06-16" ...
 $ close    : num  0.0397 0.0402 0.0334 0.0202 0.022 ...
 $ high     : num  0.0391 0.0395 0.0346 0.0235 0.0237 ...
 $ low      : num  0.0398 0.0432 0.0319 0.0241 0.023 ...
 $ open     : num  0.0363 0.0384 0.0351 0.0222 0.0219 ...
 $ volume   : num  0.396 0.358 0.675 0.717 0.462 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

dataset before Discretization

#dataset before Discretization 
print(dataset)
summary(dataset)
      date                close             high             low              open       
 Min.   :2016-06-14   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2017-08-10   1st Qu.:0.2175   1st Qu.:0.2115   1st Qu.:0.2162   1st Qu.:0.2147  
 Median :2018-09-29   Median :0.3553   Median :0.3532   Median :0.3524   Median :0.3554  
 Mean   :2018-10-03   Mean   :0.3746   Mean   :0.3718   Mean   :0.3723   Mean   :0.3736  
 3rd Qu.:2019-11-14   3rd Qu.:0.4736   3rd Qu.:0.4701   3rd Qu.:0.4698   3rd Qu.:0.4722  
 Max.   :2021-02-02   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     volume          adjClose         adjHigh           adjLow          adjOpen      
 Min.   :0.0000   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.:0.3387   1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :0.4324   Median :1115.7   Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :0.4681   Mean   :1139.9   Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:0.5669   3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :1.0000   Max.   :1927.5   Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1167344  
 Median :1394116  
 Mean   :1480717  
 3rd Qu.:1719968  
 Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" "2016-06-15" "2016-06-16" ...
 $ close    : num  0.0397 0.0402 0.0334 0.0202 0.022 ...
 $ high     : num  0.0391 0.0395 0.0346 0.0235 0.0237 ...
 $ low      : num  0.0398 0.0432 0.0319 0.0241 0.023 ...
 $ open     : num  0.0363 0.0384 0.0351 0.0222 0.0219 ...
 $ volume   : num  0.396 0.358 0.675 0.717 0.462 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

we used the Discretization technique on our class label “close” to simplify it as it has a large continuous values, we made them fall into intervals, to make it easier to analyze

and we chose the value 0.2957251 as it the mean value for the closing

dataset$close <- ifelse(dataset$close <= 0.2957251 , "low","High")
print(dataset)

we discretized it into two categories (low, high) based on the maen, low meaning it is less than the mean of the close , and high meaning it is equal to or higher than the mean.

Encoding We encoded close data into factors, which would help the model read this data easily


dataset$close <- factor(dataset$close,levels = c("low", "High"), labels = c("1", "2"))

print(dataset)

dataset after Discretization

#dataset after Discretization 
print(dataset)
summary(dataset)
      date            close        high             low              open       
 Min.   :2016-06-14   1:396   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2017-08-10   2:700   1st Qu.:0.2115   1st Qu.:0.2162   1st Qu.:0.2147  
 Median :2018-09-29           Median :0.3532   Median :0.3524   Median :0.3554  
 Mean   :2018-10-03           Mean   :0.3718   Mean   :0.3723   Mean   :0.3736  
 3rd Qu.:2019-11-14           3rd Qu.:0.4701   3rd Qu.:0.4698   3rd Qu.:0.4722  
 Max.   :2021-02-02           Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     volume          adjClose         adjHigh           adjLow          adjOpen      
 Min.   :0.0000   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.:0.3387   1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :0.4324   Median :1115.7   Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :0.4681   Mean   :1139.9   Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:0.5669   3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :1.0000   Max.   :1927.5   Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1167344  
 Median :1394116  
 Mean   :1480717  
 3rd Qu.:1719968  
 Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" "2016-06-15" "2016-06-16" ...
 $ close    : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ high     : num  0.0391 0.0395 0.0346 0.0235 0.0237 ...
 $ low      : num  0.0398 0.0432 0.0319 0.0241 0.023 ...
 $ open     : num  0.0363 0.0384 0.0351 0.0222 0.0219 ...
 $ volume   : num  0.396 0.358 0.675 0.717 0.462 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

summary after preprocessing after preprocessing the data for stock price prediction, several steps are taken to refine, clean, and prepare the data for analysis and modeling. These preprocessing steps aim to enhance the quality and reliability of the data for more accurate stock price prediction.

dataset after preprocessing

#dataset after preprocessing 
print(dataset)
summary(dataset)
      date            close        high             low              open       
 Min.   :2016-06-14   1:396   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2017-08-10   2:700   1st Qu.:0.2115   1st Qu.:0.2162   1st Qu.:0.2147  
 Median :2018-09-29           Median :0.3532   Median :0.3524   Median :0.3554  
 Mean   :2018-10-03           Mean   :0.3718   Mean   :0.3723   Mean   :0.3736  
 3rd Qu.:2019-11-14           3rd Qu.:0.4701   3rd Qu.:0.4698   3rd Qu.:0.4722  
 Max.   :2021-02-02           Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
     volume          adjClose         adjHigh           adjLow          adjOpen      
 Min.   :0.0000   Min.   : 668.3   Min.   : 672.3   Min.   : 663.3   Min.   : 671.0  
 1st Qu.:0.3387   1st Qu.: 942.2   1st Qu.: 943.8   1st Qu.: 933.8   1st Qu.: 939.7  
 Median :0.4324   Median :1115.7   Median :1125.6   Median :1104.2   Median :1115.8  
 Mean   :0.4681   Mean   :1139.9   Mean   :1149.5   Mean   :1129.1   Mean   :1138.6  
 3rd Qu.:0.5669   3rd Qu.:1264.7   3rd Qu.:1275.7   3rd Qu.:1251.1   3rd Qu.:1262.0  
 Max.   :1.0000   Max.   :1927.5   Max.   :1955.8   Max.   :1914.5   Max.   :1922.6  
   adjVolume      
 Min.   : 346753  
 1st Qu.:1167344  
 Median :1394116  
 Mean   :1480717  
 3rd Qu.:1719968  
 Max.   :2769225  
str(dataset)
'data.frame':   1096 obs. of  11 variables:
 $ date     : Date, format: "2016-06-14" "2016-06-15" "2016-06-16" ...
 $ close    : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
 $ high     : num  0.0391 0.0395 0.0346 0.0235 0.0237 ...
 $ low      : num  0.0398 0.0432 0.0319 0.0241 0.023 ...
 $ open     : num  0.0363 0.0384 0.0351 0.0222 0.0219 ...
 $ volume   : num  0.396 0.358 0.675 0.717 0.462 ...
 $ adjClose : num  718 719 710 694 696 ...
 $ adjHigh  : num  722 723 717 702 703 ...
 $ adjLow   : num  713 717 703 693 692 ...
 $ adjOpen  : num  716 719 715 699 698 ...
 $ adjVolume: int  1306065 1214517 1982471 2082538 1465634 1184318 2171415 2641085 2173762 1932561 ...

Feature selection

Feature selection is a process of selecting a subset of relevant features (or attributes) from the original set of features in a dataset. The goal of feature selection is to choose the most relevant and important features, thereby reducing dimensionality, and improving model performance.

#Feature selection ,Feature selection using Recursive Feature Elimination or RFE

library(mlbench)
library(caret)


# define the control using a random forest selection function 
# number=12 means the length of the list
control <- rfeControl(functions=rfFuncs, method="cv", number=11)
# run the RFE algorithm from column 1 to 11  
results <- rfe(dataset[,1:10],dataset[,11], sizes=c(1:10), rfeControl=control)

summarize the results

print(results)

Recursive feature selection

Outer resampling method: Cross-Validated (11 fold) 

Resampling performance over subset size:

The top 1 variables (out of 1):
   open

list the chosen features , the result shows that the most important attripute is open , so in the classification we will use it for the predection

predictors(results)
[1] "open"

plot the results

plot(results, type=c("h", "o"))

5. Data Mining Techniques

We did both supervised and unsupervised learning techniques on our dataset (Google stock prediction), which involves classification and clustering methods, for classification we did a partitioning method called the train-test split, which splits the dataset into two subsets of different ratios, and we implemented three algorithms to form 9 different decision trees.

6. Evaluation and Comparison

We will choose the attributes with the highest importance (from feature selection) to create a tree:

  1. Dividing the dataset:

we divided our dataset into two divisions for each split:

first one 70-30, which means Training(70%) and Testing(30%):

# a fixed random seed to make results reproducible
set.seed(1234)

# 1.Split the datasets into two subsets: Training(70%) and Testing(30%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c( 0.70, 0.30))
trainData  <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
  1. Determine the predictor attributes and the class label attribute.( the formula):
library(party) 
Loading required package: grid
Loading required package: mvtnorm
Loading required package: modeltools
Loading required package: stats4
Loading required package: strucchange
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: sandwich
library(grid) 
#myFormula 
myFormula <- close ~volume+open+high+low
  1. Build a decision tree using Information gain:

Information gain is a concept used in the field of machine learning and decision tree algorithms. It is a measure of the effectiveness of a particular attribute in classifying data. In the context of decision trees, information gain helps determine the order in which attributes are chosen for splitting the data.

dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
   
      1   2
  1 284  11
  2   0 476
# 4.Print and plot the tree:

print(dataset_ctree)

     Conditional inference tree with 4 terminal nodes

Response:  close 
Inputs:  volume, open, high, low 
Number of observations:  771 

1) open <= 0.2974608; criterion = 1, statistic = 423.273
  2) high <= 0.2892353; criterion = 1, statistic = 19.817
    3)*  weights = 267 
  2) high > 0.2892353
    4)*  weights = 17 
1) open > 0.2974608
  5) low <= 0.2955676; criterion = 0.995, statistic = 10.36
    6)*  weights = 11 
  5) low > 0.2955676
    7)*  weights = 476 
plot(dataset_ctree, type="simple")

# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 111   3
       2   1 210
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/wijda/AppData/Local/R/win-library/4.3’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/e1071_1.7-13.zip'
Content type 'application/zip' length 653517 bytes (638 KB)
downloaded 638 KB
package ‘e1071’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\wijda\AppData\Local\Temp\RtmpMlBbEF\downloaded_packages
library(e1071)
Warning: package ‘e1071’ was built under R version 4.3.2
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 111   3
       2   1 210
                                          
               Accuracy : 0.9877          
                 95% CI : (0.9688, 0.9966)
    No Information Rate : 0.6554          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9729          
                                          
 Mcnemar's Test P-Value : 0.6171          
                                          
            Sensitivity : 0.9911          
            Specificity : 0.9859          
         Pos Pred Value : 0.9737          
         Neg Pred Value : 0.9953          
             Prevalence : 0.3446          
         Detection Rate : 0.3415          
   Detection Prevalence : 0.3508          
      Balanced Accuracy : 0.9885          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9910714
specificity(as.table(co_result))
[1] 0.9859155
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9876923 
  1. Building the Tree using Gini Index(CART)

The Gini Index is another criterion used in decision tree algorithms, particularly in the context of the Classification and Regression Trees (CART) algorithm. Like information gain, the Gini Index is used to evaluate the impurity or homogeneity of a dataset.

The Gini Index for a specific attribute measures the probability of incorrectly classifying a randomly chosen element in the dataset. A lower Gini Index indicates a purer or more homogeneous set. In the context of decision trees, the attribute with the lowest Gini Index is chosen as the split attribute.

# For decision tree model
install.packages("rpart")
Error in install.packages : Updating loaded packages
library(rpart)
# For data visualization
library(rpart.plot)
Warning: package ‘rpart.plot’ was built under R version 4.3.2
dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))

Visualizing the unpruned tree

library(rpart.plot)
rpart.plot(dataset.cart)

Checking the order of variable importance

dataset.cart$variable.importance
       low       high       open     volume 
343.117705 330.102896 326.553402   4.732658 
pred.tree = predict(dataset.cart, testData, type = "class")

table(pred.tree,testData$close)
         
pred.tree   1   2
        1 111   3
        2   1 210
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 111   3
       2   1 210
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 111   3
       2   1 210
                                          
               Accuracy : 0.9877          
                 95% CI : (0.9688, 0.9966)
    No Information Rate : 0.6554          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9729          
                                          
 Mcnemar's Test P-Value : 0.6171          
                                          
            Sensitivity : 0.9911          
            Specificity : 0.9859          
         Pos Pred Value : 0.9737          
         Neg Pred Value : 0.9953          
             Prevalence : 0.3446          
         Detection Rate : 0.3415          
   Detection Prevalence : 0.3508          
      Balanced Accuracy : 0.9885          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9910714
specificity(as.table(co_result))
[1] 0.9859155
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9876923 
  1. Building the Tree using Gain ratio(C5)

The Gain Ratio is used to select the attribute that maximizes the Information Gain while avoiding the bias towards attributes with many values. It provides a more balanced measure for attribute selection in decision tree construction.

While Information Gain simply measures the reduction in entropy or uncertainty, Gain Ratio takes into account the intrinsic information of an attribute. It aims to penalize attributes that may have a large number of values, potentially leading to overfitting.

install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages("C50")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/wijda/AppData/Local/R/win-library/4.3’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/C50_0.1.8.zip'
Content type 'application/zip' length 342652 bytes (334 KB)
downloaded 334 KB
package ‘C50’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\wijda\AppData\Local\Temp\RtmpMlBbEF\downloaded_packages
install.packages("printr")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:

https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/wijda/AppData/Local/R/win-library/4.3’
(as ‘lib’ is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/printr_0.3.zip'
Content type 'application/zip' length 39413 bytes (38 KB)
downloaded 38 KB
package ‘printr’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\wijda\AppData\Local\Temp\RtmpMlBbEF\downloaded_packages
library(C50)
Warning: package ‘C50’ was built under R version 4.3.2
library(printr)
Warning: package ‘printr’ was built under R version 4.3.2Registered S3 method overwritten by 'printr':
  method                from     
  knit_print.data.frame rmarkdown
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)

Call:
C5.0.formula(formula = myFormula, data = trainData)


C5.0 [Release 2.07 GPL Edition]     Sat Dec  2 21:10:29 2023
-------------------------------

Class specified by attribute `outcome'

Read 771 cases (5 attributes) from undefined.data

Decision tree:

low > 0.2960392: 2 (481/1)
low <= 0.2960392:
:...high <= 0.2892354: 1 (266)
    high > 0.2892354:
    :...high > 0.3075281: 2 (2)
        high <= 0.3075281:
        :...open <= 0.278852: 2 (2)
            open > 0.278852: 1 (20/3)


Evaluation on training data (771 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         5    4( 0.5%)   <<


       (a)   (b)    <-classified as
      ----  ----
       283     1    (a): class 1
         3   484    (b): class 2


    Attribute usage:

    100.00% low
     37.61% high
      2.85% open


Time: 0.0 secs
plot(CloseTree)

# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(CloseTree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 110   1
       2   2 212
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 110   1
       2   2 212
                                          
               Accuracy : 0.9908          
                 95% CI : (0.9733, 0.9981)
    No Information Rate : 0.6554          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9795          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9821          
            Specificity : 0.9953          
         Pos Pred Value : 0.9910          
         Neg Pred Value : 0.9907          
             Prevalence : 0.3446          
         Detection Rate : 0.3385          
   Detection Prevalence : 0.3415          
      Balanced Accuracy : 0.9887          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9821429
specificity(as.table(co_result))
[1] 0.9953052
precision(as.table(co_result))
[1] 0.990991
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9907692 

second one 60-40, which means Training(60%) and Testing(40%):

# a fixed random seed to make results reproducible
set.seed(1234)

# 1.Split the datasets into two subsets: Training(60%) and Testing(40%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c(0.60 , 0.40))
trainData  <- dataset[ind1==1,]
testData <- dataset[ind1==2,]
  1. Determine the predictor attributes and the class label attribute.( the formula):
library(party) 
library(grid)
#myFormula 
myFormula <- close ~volume+open+high+low
  1. Build a decision tree using training set and check the Prediction:
dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
   
      1   2
  1 249   8
  2   0 398
# 4.Print and plot the tree:

print(dataset_ctree)

     Conditional inference tree with 4 terminal nodes

Response:  close 
Inputs:  volume, open, high, low 
Number of observations:  655 

1) open <= 0.2974608; criterion = 1, statistic = 363.998
  2) high <= 0.2892353; criterion = 0.998, statistic = 11.719
    3)*  weights = 235 
  2) high > 0.2892353
    4)*  weights = 12 
1) open > 0.2974608
  5) low <= 0.2955676; criterion = 0.987, statistic = 8.71
    6)*  weights = 10 
  5) low > 0.2955676
    7)*  weights = 398 
plot(dataset_ctree, type="simple")

# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 146   6
       2   1 288
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 146   6
       2   1 288
                                          
               Accuracy : 0.9841          
                 95% CI : (0.9676, 0.9936)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9646          
                                          
 Mcnemar's Test P-Value : 0.1306          
                                          
            Sensitivity : 0.9932          
            Specificity : 0.9796          
         Pos Pred Value : 0.9605          
         Neg Pred Value : 0.9965          
             Prevalence : 0.3333          
         Detection Rate : 0.3311          
   Detection Prevalence : 0.3447          
      Balanced Accuracy : 0.9864          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9931973
specificity(as.table(co_result))
[1] 0.9795918
precision(as.table(co_result))
[1] 0.9605263
acc <- co_result$overall["Accuracy"]
acc
Accuracy 
0.984127 
  1. Building the Tree using Gini Index(CART)
# For decision tree model
install.packages("rpart")
Error in install.packages : Updating loaded packages
library(rpart)
# For data visualization
library(rpart.plot)

dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))

Visualizing the unpruned tree

rpart.plot(dataset.cart)

Checking the order of variable importance

dataset.cart$variable.importance
       low       high       open     volume 
294.972422 284.520643 282.198025   4.645235 
pred.tree = predict(dataset.cart, testData, type = "class")

table(pred.tree,testData$close)
         
pred.tree   1   2
        1 146   4
        2   1 290
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 146   6
       2   1 288
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 146   6
       2   1 288
                                          
               Accuracy : 0.9841          
                 95% CI : (0.9676, 0.9936)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9646          
                                          
 Mcnemar's Test P-Value : 0.1306          
                                          
            Sensitivity : 0.9932          
            Specificity : 0.9796          
         Pos Pred Value : 0.9605          
         Neg Pred Value : 0.9965          
             Prevalence : 0.3333          
         Detection Rate : 0.3311          
   Detection Prevalence : 0.3447          
      Balanced Accuracy : 0.9864          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9931973
specificity(as.table(co_result))
[1] 0.9795918
precision(as.table(co_result))
[1] 0.9605263
acc <- co_result$overall["Accuracy"]
acc
Accuracy 
0.984127 
  1. Building the Tree using Gain ratio(C5)

install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages("C50")
Error in install.packages : Updating loaded packages
install.packages("printr")
Error in install.packages : Updating loaded packages
library(C50)
library(printr)
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)

summary(CloseTree)

Call:
C5.0.formula(formula = myFormula, data = trainData)


C5.0 [Release 2.07 GPL Edition]     Sat Dec  2 21:13:58 2023
-------------------------------

Class specified by attribute `outcome'

Read 655 cases (5 attributes) from undefined.data

Decision tree:

low <= 0.2960392: 1 (254/6)
low > 0.2960392: 2 (401/1)


Evaluation on training data (655 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         2    7( 1.1%)   <<


       (a)   (b)    <-classified as
      ----  ----
       248     1    (a): class 1
         6   400    (b): class 2


    Attribute usage:

    100.00% low


Time: 0.0 secs
plot(CloseTree)

# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(CloseTree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1 146   4
       2   1 290
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1 146   4
       2   1 290
                                          
               Accuracy : 0.9887          
                 95% CI : (0.9737, 0.9963)
    No Information Rate : 0.6667          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9746          
                                          
 Mcnemar's Test P-Value : 0.3711          
                                          
            Sensitivity : 0.9932          
            Specificity : 0.9864          
         Pos Pred Value : 0.9733          
         Neg Pred Value : 0.9966          
             Prevalence : 0.3333          
         Detection Rate : 0.3311          
   Detection Prevalence : 0.3401          
      Balanced Accuracy : 0.9898          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9931973
specificity(as.table(co_result))
[1] 0.9863946
precision(as.table(co_result))
[1] 0.9733333
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9886621 

Third one 80-20, which means Training(80%) and Testing(20%):

# a fixed random seed to make results reproducible
set.seed(1234)

# 1.Split the datasets into two subsets: Training(80%) and Testing(20%):
ind1 <- sample(2, nrow(dataset), replace=TRUE, prob=c(0.80 , 0.20))
trainData  <- dataset[ind1==1,]
testData <- dataset[ind1==2,]

2.Determine the predictor attributes and the class label attribute.( the formula):

library(party)  
library(grid)
#myFormula 
myFormula <- close ~volume+open+high+low

3.Build a decision tree using training set and check the Prediction:

dataset_ctree <- ctree(myFormula, data=trainData)
table(predict(dataset_ctree), trainData$close)
   
      1   2
  1 322  14
  2   0 535
# 4.Print and plot the tree:

print(dataset_ctree)

     Conditional inference tree with 4 terminal nodes

Response:  close 
Inputs:  volume, open, high, low 
Number of observations:  871 

1) open <= 0.2974608; criterion = 1, statistic = 478.791
  2) high <= 0.2892353; criterion = 1, statistic = 22.684
    3)*  weights = 303 
  2) high > 0.2892353
    4)*  weights = 19 
1) open > 0.2974608
  5) low <= 0.2997876; criterion = 0.997, statistic = 11.651
    6)*  weights = 14 
  5) low > 0.2997876
    7)*  weights = 535 
plot(dataset_ctree, type="simple")

# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1  74   2
       2   0 149
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1  74   2
       2   0 149
                                          
               Accuracy : 0.9911          
                 95% CI : (0.9683, 0.9989)
    No Information Rate : 0.6711          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.98            
                                          
 Mcnemar's Test P-Value : 0.4795          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.9868          
         Pos Pred Value : 0.9737          
         Neg Pred Value : 1.0000          
             Prevalence : 0.3289          
         Detection Rate : 0.3289          
   Detection Prevalence : 0.3378          
      Balanced Accuracy : 0.9934          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 1
specificity(as.table(co_result))
[1] 0.986755
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9911111 
  1. Building the Tree using Gini Index(CART)
# For decision tree model
install.packages("rpart")
Error in install.packages : Updating loaded packages
library(rpart)
# For data visualization
library(rpart.plot)

dataset.cart <- rpart(myFormula, data = trainData, method = "class", parms = list(split = "gini"))

Visualizing the unpruned tree

library(rpart.plot)
rpart.plot(dataset.cart)

Checking the order of variable importance

dataset.cart$variable.importance
       low       high       open     volume 
386.324609 371.012963 368.657326   4.711276 
pred.tree = predict(dataset.cart, testData, type = "class")

table(pred.tree,testData$close)
         
pred.tree   1   2
        1  74   2
        2   0 149
# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(dataset_ctree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1  74   2
       2   0 149
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1  74   2
       2   0 149
                                          
               Accuracy : 0.9911          
                 95% CI : (0.9683, 0.9989)
    No Information Rate : 0.6711          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.98            
                                          
 Mcnemar's Test P-Value : 0.4795          
                                          
            Sensitivity : 1.0000          
            Specificity : 0.9868          
         Pos Pred Value : 0.9737          
         Neg Pred Value : 1.0000          
             Prevalence : 0.3289          
         Detection Rate : 0.3289          
   Detection Prevalence : 0.3378          
      Balanced Accuracy : 0.9934          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 1
specificity(as.table(co_result))
[1] 0.986755
precision(as.table(co_result))
[1] 0.9736842
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9911111 
  1. Building the Tree using Gain ratio(C5)
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages("C50")
Error in install.packages : Updating loaded packages
install.packages("printr")
Error in install.packages : Updating loaded packages
library(C50)
library(printr)
library(caret)
#train using the trainData and create the c5.0 gain ratio tree
CloseTree <- C5.0(myFormula, data=trainData)
summary(CloseTree)

Call:
C5.0.formula(formula = myFormula, data = trainData)


C5.0 [Release 2.07 GPL Edition]     Sat Dec  2 21:16:23 2023
-------------------------------

Class specified by attribute `outcome'

Read 871 cases (5 attributes) from undefined.data

Decision tree:

low > 0.2960392:
:...open > 0.3106603: 2 (518)
:   open <= 0.3106603:
:   :...high <= 0.2916803: 1 (2)
:       high > 0.2916803: 2 (23)
low <= 0.2960392:
:...high <= 0.2892354: 1 (302)
    high > 0.2892354:
    :...open <= 0.278852: 2 (2)
        open > 0.278852:
        :...high <= 0.3075281: 1 (22/4)
            high > 0.3075281: 2 (2)


Evaluation on training data (871 cases):

        Decision Tree   
      ----------------  
      Size      Errors  

         7    4( 0.5%)   <<


       (a)   (b)    <-classified as
      ----  ----
       322          (a): class 1
         4   545    (b): class 2


    Attribute usage:

    100.00% low
     65.33% open
     40.53% high


Time: 0.0 secs
plot(CloseTree)

# 5.Use the constructed model to predict the class labels of test data:
testPred <- predict(CloseTree, newdata = testData)
result<-table(testPred, testData$close)
result
        
testPred   1   2
       1  73   0
       2   1 151
# Evaluate the model and create confusion matrix
install.packages("caret")
Error in install.packages : Updating loaded packages
install.packages('e1071', dependencies=TRUE)
Error in install.packages : Updating loaded packages
library(e1071)
library(caret)

co_result <- confusionMatrix(result)

print(co_result)
Confusion Matrix and Statistics

        
testPred   1   2
       1  73   0
       2   1 151
                                          
               Accuracy : 0.9956          
                 95% CI : (0.9755, 0.9999)
    No Information Rate : 0.6711          
    P-Value [Acc > NIR] : <2e-16          
                                          
                  Kappa : 0.9899          
                                          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.9865          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.9934          
             Prevalence : 0.3289          
         Detection Rate : 0.3244          
   Detection Prevalence : 0.3244          
      Balanced Accuracy : 0.9932          
                                          
       'Positive' Class : 1               
                                          
sensitivity(as.table(co_result))
[1] 0.9864865
specificity(as.table(co_result))
[1] 1
precision(as.table(co_result))
[1] 1
acc <- co_result$overall["Accuracy"]
acc
 Accuracy 
0.9955556 

after doing all the three methods we have noticed that in IG and Gini Index(CART)

the Training(70%) and Testing(30%) has sensitivity = 0.9959016 specificity = 0.9685039 Accuracy = 0.9865229

the Training(60%) and Testing(40%) has sensitivity = 0.9969512 specificity = 0.9710983 Accuracy = 0.988024

the Training(80%) and Testing(20%) has sensitivity = 0.9940476 specificity = 0.9655172 Accuracy = 0.9843137

For the gain ratio the Training(70%) and Testing(30%) has sensitivity = 0.9821429 specificity = 0.9953052 Accuracy = 0.9907692 and precision=0.990991

the Training(60%) and Testing(40%) has sensitivity = 0.9931973 specificity = 0.9863946 Accuracy = 0.9886621 and precision=0.9733333

the Training(80%) and Testing(20%) has sensitivity = 0.9864865 specificity = 1 Accuracy = 0.995556 and precision=1

which means that the best spilting in our dataset is the Training(60%) and Testing(40%) because it is has the highest sensitivity = 0.9940476 %99.4 , specificity = 0.9655172 %96.5 , Accuracy = 0.988024 %98.8 and in gain ratio has sensitivity = 0.9931973 specificity = 0.9863946 Accuracy = 0.9886621 and precision=0.9733333

Clustering is unsupervised learning, it doesn’t use a class label for implementing the cluster. To implement the clusters, we used the K-mean algorithm, which is an algorithm that produces K clusters, which each cluster is represented by the center point of the cluster and assigns each object to the nearest cluster, then iteratively recalculates the center, and reassigns the object until the center point of each cluster does not change that means the object in the right cluster.

factoextra packages is used to help in implementing the clustering technique. scale() method is used for scaling and centering of data set objects, Kmeans() method to find a specified number of clusters. fviz_cluster() method to visualize the clusters diagram. silhouette() method to calculate the average for each cluster, fviz_silhouette() to visualize it, and fviz_nbclust() method to set a comparison between the three different numbers of clusters to find the optimal number by evaluating the clusters according to how well the clusters are separated, and how compact the clusters are. In both techniques, we used the method set.seed() with the same random number each time we try a different size to ensure that we get the same result each time.

Data types should be transformed into numeric types before clustering.

# prepreocessing 
#Data types should be transformed into numeric types before clustering.
dataset<-dataset[,3:11]
dataset <- scale(dataset)
View(dataset)
# k-means clustering to find 4 clusters 
#set a seed for random number generation  to make the results reproducible
set.seed(8953)
kmeans.result <- kmeans(dataset, 4)

visualization of 4 clusters

# visualize clustering
#install.packages("factoextra")
library(factoextra)
Warning: package ‘factoextra’ was built under R version 4.3.2Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_cluster(kmeans.result, data = dataset)